Robust Tools and Services for Long-Term Preservation of Digital Information

نویسندگان

  • Joseph JáJá
  • Sangchul Song
چکیده

An unprecedented amount of digital information, appearing on a daily basis, needs to be archived and preserved over long time periods. Such information covers major facets of human activities such as business exchanges and electronic commerce, cultural and social interactions, e-government and legal proceedings, scientific studies and data collections, and even personal data such as digital photos and videos. It has been widely recognized that digital preservation is in general a very challenging process that requires innovations in institutional and business models, technology infrastructure, and social and legal frameworks. In this paper, we will report on some of the core archiving and preservation tools and services that we developed under a general technology framework called ADAPT—Approach to Digital Archiving and Preservation Technology. The ADAPT model is based on a layered, digital object architecture that includes a set of modular tools and services built using open standards and Web technologies. These tools are designed so that they can easily accommodate new standards and policies while gracefully adapting to the underlying technologies as they evolve. In particular, we will briefly describe our tools to (1) proactively audit and ensure data integrity over the lifetime of an archived digital object, (2) enable compact storage and fast access to large scale Web archives, and (3) manage ingestion workflows under a wide variety of environments. Most of these tools are currently being used to support the digital preservation environment of NDIIPP institution contents through the Chronopolis project. Robust Tools and Services for Long-Term Preservation of Digital Information Joseph JaJa and Sangchul Song LIBRARY TRENDS, Vol. 57, No. 3, Winter 2009 (“The Library of Congress National Digital Information Infrastructure and Preservation Program,” edited by Patricia Cruse and Beth Sandore), pp. 580–594 (c) 2009 The Board of Trustees, University of Illinois 581 jaja/long-term preservation Introduction A large portion of the scientific, business, cultural, and government digital information being created today needs to be maintained and preserved for future use of periods ranging from a few years to decades and sometimes centuries. Since the mid-nineties, the issue of long-term preservation of digital information has received considerable attention by major archiving communities, library organizations, government agencies, scientific communities, and individual researchers. These studies have identified major challenges regarding institutional and business models, technology infrastructure, and social and legal frameworks, which need to be addressed to achieve long-term reliable archiving of and access to digital information. Selected references that cover some of these findings are (Hedstrom, 2002; Hedstrom et al., 2003; Thibodeau, 2002). Focusing on the technology component, we note that a significant number of initiatives have been set up to develop technology prototypes to tackle some aspects of this problem. These initiatives include the Internet Archive (Kahle, 1997), the National Library of Australia’s PANDORA project (n.d.), LOCKSS (Maniatis et al., 2005), the TPAP—Transcontinental Persistent Archive Prototype (Moore et al., 2003), the Universal Virtual Computer (Lorie, 2002), the Electronic Records Archives program at the National Archives (National Archives and Records Administration, n.d.), and the Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP) (The National Digital Information Infrastructure and Preservation Program, the Library of Congress). The traditional archiving and preservation approach has been a distributed activity in which each organization maintains and preserves its holdings with relatively little sharing. Such an approach is based on wellunderstood and proven processes for archiving and preserving physical holdings, which have been refined over the years. On the other hand, digital preservation is a very recent activity that is faced with a major technology challenge due in part to the large amount of important digital information generated on a daily basis, the fast pace of technology evolution, and the relative fragility of digital information and computing infrastructure. As a result, it appears that systematic methodologies are needed to address the following key requirements: • Encapsulation of information regarding content, structure, context, provenance, and access within each digital object to enable the longterm maintenance and lifecycle management of the digital object. • Efficient management of technology evolution, both hardware and software, and the appropriate handling of technology obsolescence (for example, format obsolescence). • Efficient risk management and disaster recovery mechanisms either from technology degradation and failure, or natural disasters such as 582 library trends/winter 2009 fires, floods, and hurricanes, or human-induced operational errors, or security failures and breaches. • Efficient proactive mechanisms to ensure the authenticity and integrity of content, context, and structure of archived information throughout the preservation period. • Ability for information discovery and content access and presentation, with an automatic enforcement of authorization and IP rights, throughout the life cycle of each object. • Scalability in terms of ingestion rate, capacity, and processing power to manage and preserve large scale heterogeneous collections of complex objects, and the speed at which users can discover and retrieve information. • Ability to accommodate possible changes over time in organizational structures and stewardships, relocation, and repurposing. The reports (Hedstrom, 2002; Hedstrom et al., 2003), while relatively old, give a good summary of the main technology challenges facing longterm digital preservation and archiving. In this paper, we present an overview of a number of our tools that were designed to address several of the requirements listed above and that are currently in use by the Chronopolis preservation environment. The Chronopolis project, a National Digital Information Infrastructure and Preservation Program (NDIIPP) supported effort, offers a distributed data grid architecture with storage located at the University of Maryland, San Diego Supercomputer Center (SDSC), and the National Center for Atmospheric Research (NCAR). The main goal of Chronopolis is to provide long-term archiving and preservation services on contents coming from NDIIPP partners. Initial contents have been provided by the California Digital Library (CDL) and the Inter-University Consortium for Political and Social Science (ICPSR). The ADAPT Approach Long-term preservation of digital information is a process that must begin before the data is ingested into an archival system and must remain continuously active throughout the life cycle management of the digital objects. In fact, an understanding of exactly what is being preserved and how to precisely incorporate such information is a critical step that must be completed before any ingestion can begin. While the traditional archiving processes of appraisal, accessioning, arrangement, description, preservation, access, and repurposing are well understood for archiving and preserving physical holdings, they are quite lacking in addressing digital preservation. Our technology approach is based on a number of premises. The first premise is to capture properties of content, structure, context, presenta583 jaja/long-term preservation tion, and preservation within a digital object architecture, and enable the infrastructure to manage and preserve these objects. The digital object must contain the essential features that encapsulate what is being preserved, and should include behavioral information about its life cycle management and preservation. An early work to advocate a digital object architecture appears in (Kahn and Wilensky, 1995), which led to the development of the Handle system (Corporation for National Research Initiatives, n.d.) for assigning persistent global identifiers. The second premise of our approach is to separate the archive’s management of the digital objects into three levels of abstraction, resulting in a well-defined three-layered architecture. The data layer is responsible for managing the bits representing the digital object across storage systems (evolving through both time and space), while the second layer deals with the semantics of the data and relationships between objects rather than storage and bits. The third layer deals with services related to monitoring, preservation, and management policies. Finally, we borrow considerably from the Open Archival Information System (OAIS) reference framework Reference Model for an Open Archival Information System (OAIS) (Consultative Committee for Space Data Systems, 2002), including overall terminology. Briefly, this model consists of producers, an archive, and consumers, where the producers prepare and transfer data to the archive, which is responsible for managing the digital information for long-term preservation and for providing an interface to the consumers for accessing the information as needed. For each stage, OAIS provides a detailed model of the information, called respectively the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). The overall ADAPT model can be represented as shown in figure 1. Our efforts are aimed toward the development of tools and services in support of the components represented by the shaded boxes. Thus far, our team has developed tools and services to handle the ingestion workflow and some aspects of the preservation, search, and access services. These tools are independent of the architecture of the data or the metadata layer, and will work with either centralized or distributed infrastructure. Our only assumptions regarding these two layers are that (1) each digital object has a unique persistent name; and (2) the data layer maintains more than a single copy of each digital copy, one of which is designated as the master copy. Otherwise our tools are completely platform-independent and will easily interoperate with any archive using the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RODA and Crib : A Service-Oriented Digital Repository

In 2006 the Portuguese National Archives (DirectorateGeneral of the Portuguese Archives) engaged in the development of an OAIS compatible digital repository system for long-term preservation of digital material. Simultaneously, at the University of Minho a project called CRiB was being devised which aimed at the development of a wholesome set of services to aid digital preservation. Among those...

متن کامل

Hoppla - Digital Preservation Support for Small Institutions

Small businesses (small office/home office, SOHO) have tremendous amounts of digital information. At the same time, they have little to no expertise on how to manage it, not to mention caring for their long-term preservation, as even simple back-up strategies pose already drastic challenges. This demo presents the Hoppla archiving system to provide digital preservation solutions specifically fo...

متن کامل

'Digital Preservation: The Planets Way': Outreach And Training For Digital Preservation Using Planets Tools And Services

This paper outlines the Europe-wide programme of outreach and training events, jointly organised by HATII at the University of Glasgow and the British Library, in collaboration with a number of European partner institutions, on behalf of the Planets project (Preservation and Long Term Access Through Networked Services) between June 2009 and April 2010. It describes the background to the program...

متن کامل

Towards a Definition of Digital Information Preservation Object

In this paper we discuss long-term digital preservation from an information perspective, rather than the predominant approaches; the Archival and the Technocratic Approach. Our standpoint is that information is at the core of long-term digital preservation. This means that information is the object of preservation in the long term. Information lives longer than people, organizations and tools (...

متن کامل

Planets - Preservation and Long - term Access through NETworked Services Outreach and Training events 2009

For the first time, the amount of digital information produced exceeds the storage space available. Naturally, not all of this information has to be maintained for the long term, but unarguably more and more born-digital information will be created and attention needs to be paid to preservation for the long-term. Unlike information stored on physical media such as stone, paper or parchment, inf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Library Trends

دوره 57  شماره 

صفحات  -

تاریخ انتشار 2009